95 research outputs found
Unsupervised Pre-Training of Image Features on Non-Curated Data
International audiencePre-training general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused on either small or highly curated datasets like ImageNet, whereas using non-curated raw datasets was found to decrease the feature quality when evaluated on a transfer task. Our goal is to bridge the performance gap between unsupervised methods trained on curated data, which are costly to obtain, and massive raw datasets that are easily available. To that effect, we propose a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data. We validate our approach on 96 million images from YFCC100M [42], achieving state-of-the-art results among unsupervised methods on standard benchmarks, which confirms the potential of unsupervised learning when only non-curated raw data are available. We also show that pre-training a supervisedVGG-16 with our method achieves 74.9% top-1 classification accuracy on the validation set of ImageNet, which is an improvement of +0.8% over the same network trained from scratch. Our code is available at https://github.com/facebookresearch/DeeperCluster
Pruning Convolutional Neural Networks with Self-Supervision
Convolutional neural networks trained without supervision come close to
matching performance with supervised pre-training, but sometimes at the cost of
an even higher number of parameters. Extracting subnetworks from these large
unsupervised convnets with preserved performance is of particular interest to
make them less computationally intensive. Typical pruning methods operate
during training on a task while trying to maintain the performance of the
pruned network on the same task. However, in self-supervised feature learning,
the training objective is agnostic on the representation transferability to
downstream tasks. Thus, preserving performance for this objective does not
ensure that the pruned subnetwork remains effective for solving downstream
tasks. In this work, we investigate the use of standard pruning methods,
developed primarily for supervised learning, for networks trained without
labels (i.e. on self-supervised tasks). We show that pruned masks obtained with
or without labels reach comparable performance when re-trained on labels,
suggesting that pruning operates similarly for self-supervised and supervised
learning. Interestingly, we also find that pruning preserves the transfer
performance of self-supervised subnetwork representations
A Memory Transformer Network for Incremental Learning
We study class-incremental learning, a training setup in which new classes of
data are observed over time for the model to learn from. Despite the
straightforward problem formulation, the naive application of classification
models to class-incremental learning results in the "catastrophic forgetting"
of previously seen classes. One of the most successful existing methods has
been the use of a memory of exemplars, which overcomes the issue of
catastrophic forgetting by saving a subset of past data into a memory bank and
utilizing it to prevent forgetting when training future tasks. In our paper, we
propose to enhance the utilization of this memory bank: we not only use it as a
source of additional training data like existing works but also integrate it in
the prediction process explicitly.Our method, the Memory Transformer Network
(MTN), learns how to combine and aggregate the information from the nearest
neighbors in the memory with a transformer to make more accurate predictions.
We conduct extensive experiments and ablations to evaluate our approach. We
show that MTN achieves state-of-the-art performance on the challenging
ImageNet-1k and Google-Landmarks-1k incremental learning benchmarks
Verbs in Action: Improving verb understanding in video-language models
Understanding verbs is crucial to modelling how people and objects interact
with each other and the environment through space and time. Recently,
state-of-the-art video-language models based on CLIP have been shown to have
limited verb understanding and to rely extensively on nouns, restricting their
performance in real-world video applications that require action and temporal
understanding. In this work, we improve verb understanding for CLIP-based
video-language models by proposing a new Verb-Focused Contrastive (VFC)
framework. This consists of two main components: (1) leveraging pretrained
large language models (LLMs) to create hard negatives for cross-modal
contrastive learning, together with a calibration strategy to balance the
occurrence of concepts in positive and negative pairs; and (2) enforcing a
fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art
results for zero-shot performance on three downstream tasks that focus on verb
understanding: video-text matching, video question-answering and video
classification. To the best of our knowledge, this is the first work which
proposes a method to alleviate the verb understanding problem, and does not
simply highlight it
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
Unsupervised image representations have significantly reduced the gap with
supervised pretraining, notably with the recent achievements of contrastive
learning methods. These contrastive methods typically work online and rely on a
large number of explicit pairwise feature comparisons, which is computationally
challenging. In this paper, we propose an online algorithm, SwAV, that takes
advantage of contrastive methods without requiring to compute pairwise
comparisons. Specifically, our method simultaneously clusters the data while
enforcing consistency between cluster assignments produced for different
augmentations (or views) of the same image, instead of comparing features
directly as in contrastive learning. Simply put, we use a swapped prediction
mechanism where we predict the cluster assignment of a view from the
representation of another view. Our method can be trained with large and small
batches and can scale to unlimited amounts of data. Compared to previous
contrastive methods, our method is more memory efficient since it does not
require a large memory bank or a special momentum network. In addition, we also
propose a new data augmentation strategy, multi-crop, that uses a mix of views
with different resolutions in place of two full-resolution views, without
increasing the memory or compute requirements much. We validate our findings by
achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as
surpassing supervised pretraining on all the considered transfer tasks.Comment: NeurIPS 202
Weakly-Supervised Surgical Phase Recognition
A key element of computer-assisted surgery systems is phase recognition of
surgical videos. Existing phase recognition algorithms require frame-wise
annotation of a large number of videos, which is time and money consuming. In
this work we join concepts of graph segmentation with self-supervised learning
to derive a random-walk solution for per-frame phase prediction. Furthermore,
we utilize within our method two forms of weak supervision: sparse timestamps
or few-shot learning. The proposed algorithm enjoys low complexity and can
operate in lowdata regimes. We validate our method by running experiments with
the public Cholec80 dataset of laparoscopic cholecystectomy videos,
demonstrating promising performance in multiple setups
Pruning Convolutional Neural Networks with Self-Supervision
Convolutional neural networks trained without supervision come close to matching performance with supervised pre-training, but sometimes at the cost of an even higher number of parameters. Extracting subnetworks from these large unsupervised convnets with preserved performance is of particular interest to make them less computationally intensive. Typical pruning methods operate during training on a task while trying to maintain the performance of the pruned network on the same task. However, in self-supervised feature learning, the training objective is agnostic on the representation transferability to downstream tasks. Thus, preserving performance for this objective does not ensure that the pruned subnetwork remains effective for solving downstream tasks. In this work, we investigate the use of standard pruning methods, developed primarily for supervised learning, for networks trained without labels (i.e. on self-supervised tasks). We show that pruned masks obtained with or without labels reach comparable performance when retrained on labels, suggesting that pruning operates similarly for self-supervised and supervised learning. Interestingly, we also find that pruning preserves the transfer performance of self-supervised subnetwork representations
Cystites : l'antibiorésistance est très dépendante du terrain
International audienc
- …